6 research outputs found
SMAuC -- The Scientific Multi-Authorship Corpus
With an ever-growing number of new publications each day, scientific writing
poses an interesting domain for authorship analysis of both single-author and
multi-author documents. Unfortunately, most existing corpora lack either
material from the science domain or the required metadata. Hence, we present
SMAuC, a new metadata-rich corpus designed specifically for authorship analysis
in scientific writing. With more than three million publications from various
scientific disciplines, SMAuC is the largest openly available corpus for
authorship analysis to date. It combines a wide and diverse range of scientific
texts from the humanities and natural sciences with rich and curated metadata,
including unique and carefully disambiguated author IDs. We hope SMAuC will
contribute significantly to advancing the field of authorship analysis in the
science domain
The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives
The Archive Query Log (AQL) is a previously unused, comprehensive query log
collected at the Internet Archive over the last 25 years. Its first version
includes 356 million queries, 166 million search result pages, and 1.7 billion
search results across 550 search providers. Although many query logs have been
studied in the literature, the search providers that own them generally do not
publish their logs to protect user privacy and vital business data. Of the few
query logs publicly available, none combines size, scope, and diversity. The
AQL is the first to do so, enabling research on new retrieval models and
(diachronic) search engine analyses. Provided in a privacy-preserving manner,
it promotes open research as well as more transparency and accountability in
the search industry.Comment: SIGIR 2023 resource paper, 13 page
Evaluating Generative Ad Hoc Information Retrieval
Recent advances in large language models have enabled the development of
viable generative information retrieval systems. A generative retrieval system
returns a grounded generated text in response to an information need instead of
the traditional document ranking. Quantifying the utility of these types of
responses is essential for evaluating generative retrieval systems. As the
established evaluation methodology for ranking-based ad hoc retrieval may seem
unsuitable for generative retrieval, new approaches for reliable, repeatable,
and reproducible experimentation are required. In this paper, we survey the
relevant information retrieval and natural language processing literature,
identify search tasks and system architectures in generative retrieval, develop
a corresponding user model, and study its operationalization. This theoretical
analysis provides a foundation and new insights for the evaluation of
generative ad hoc retrieval systems.Comment: 14 pages, 5 figures, 1 tabl
STEREO: Scientific Text Reuse in Open Access Publications
We present the Webis-STEREO-21 dataset, a massive collection of Scientific
Text Reuse in Open-access publications. It contains more than 91 million cases
of reused text passages found in 4.2 million unique open-access publications.
Featuring a high coverage of scientific disciplines and varieties of reuse, as
well as comprehensive metadata to contextualize each case, our dataset
addresses the most salient shortcomings of previous ones on scientific writing.
Webis-STEREO-21 allows for tackling a wide range of research questions from
different scientific backgrounds, facilitating both qualitative and
quantitative analysis of the phenomenon as well as a first-time grounding on
the base rate of text reuse in scientific publications.Comment: 14 pages, 3 figures, 4 table
Shared Tasks as Tutorials: A Methodical Approach
In this paper, we discuss the benefits and challenges of shared tasks as a teaching method. A shared task is a scientific event and a friendly competition to solve a research problem, the task. In terms of linking research and teaching, shared-task-based tutorials fulfill several faculty desires: they leverage students' interdisciplinary and heterogeneous skills, foster teamwork, and engage them in creative work that has the potential to produce original research contributions. Based on ten information retrieval (IR) courses at two universities since 2019 with shared tasks as tutorials, we derive a domain-neutral process model to capture the respective tutorial structure. Meanwhile, our teaching method has been adopted by other universities in IR courses, but also in other areas of AI such as natural language processing and robotics